-
Notifications
You must be signed in to change notification settings - Fork 171
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor!: metrics #2464
refactor!: metrics #2464
Conversation
iroh-blobs/Cargo.toml
Outdated
@@ -72,7 +72,7 @@ package = "iroh-quinn" | |||
version = "0.10" | |||
|
|||
[features] | |||
default = ["fs-store"] | |||
default = ["fs-store", "metrics"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so this means metrics collection is on by default in blobs now, or just that the feature is available?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reworked so that dep is always available but metrics collection is not on by default.
iroh-dns-server/src/metrics.rs
Outdated
@@ -9,7 +9,7 @@ use struct_iterable::Iterable; | |||
pub struct Metrics { | |||
pub pkarr_publish_update: Counter, | |||
pub pkarr_publish_noop: Counter, | |||
pub pkarr_publish_error: Counter, | |||
// pub pkarr_publish_error: Counter, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stray comment, plz remove
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The point of this PR was for the team to claim which ones they would like to keep as they are unused. I'm going to nuke all the ones people didn't vote to keep.
iroh-net/src/magicsock/metrics.rs
Outdated
pub rebind_calls: Counter, | ||
// pub rebind_calls: Counter, | ||
pub re_stun_calls: Counter, | ||
pub update_endpoints: Counter, | ||
|
||
// Sends (data or disco) | ||
pub send_relay_queued: Counter, | ||
pub send_relay_error_chan: Counter, | ||
pub send_relay_error_closed: Counter, | ||
pub send_relay_error_queue: Counter, | ||
// pub send_relay_queued: Counter, | ||
// pub send_relay_error_chan: Counter, | ||
// pub send_relay_error_closed: Counter, | ||
// pub send_relay_error_queue: Counter, | ||
pub send_ipv4: Counter, | ||
pub send_ipv4_error: Counter, | ||
// pub send_ipv4_error: Counter, | ||
pub send_ipv6: Counter, | ||
pub send_ipv6_error: Counter, | ||
// pub send_ipv6_error: Counter, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove commented-out code
iroh-net/src/magicsock/metrics.rs
Outdated
@@ -40,7 +40,7 @@ pub struct Metrics { | |||
pub sent_disco_ping: Counter, | |||
pub sent_disco_pong: Counter, | |||
pub sent_disco_call_me_maybe: Counter, | |||
pub recv_disco_bad_peer: Counter, | |||
// pub recv_disco_bad_peer: Counter, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove commented-out code
iroh-net/src/magicsock/metrics.rs
Outdated
@@ -49,7 +49,7 @@ pub struct Metrics { | |||
pub recv_disco_ping: Counter, | |||
pub recv_disco_pong: Counter, | |||
pub recv_disco_call_me_maybe: Counter, | |||
pub recv_disco_call_me_maybe_bad_node: Counter, | |||
// pub recv_disco_call_me_maybe_bad_node: Counter, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove
iroh-dns-server/src/metrics.rs
Outdated
@@ -9,7 +9,7 @@ use struct_iterable::Iterable; | |||
pub struct Metrics { | |||
pub pkarr_publish_update: Counter, | |||
pub pkarr_publish_noop: Counter, | |||
pub pkarr_publish_error: Counter, | |||
// pub pkarr_publish_error: Counter, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is obviously guessing at usage. but i'm surprised you want to remove this. i'm used to seeing most counters in at least a pair of _total
and _error
so you can do error ratios.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, however these are unused counters and I think those can be misleading as you expect to see something in errors yet they never increment even if there was an error.
@@ -1762,25 +1762,30 @@ impl Actor { | |||
}; | |||
|
|||
loop { | |||
inc!(Metrics, actor_tick_main); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all these _tick_
metrics are basically trace-level metrics. is this really a good idea? We have trace-level logging for this. What do you expect these to be used for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we used these quite a lot in the past to detect the health of tokio loops
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So 2 reasons for this:
- In general I'd like to move away from trace logging and shift towards trace metrics. Much easier to reason with it on a chart. Though logs do have their place where explicit order is really important.
- We actually used these in development MUCH more and we're more useful to us than "product level health" metrics which you look at when deployed.
I could follow up on this and maybe introduce a split for "trace_metrics" vs "metrics" if people find these too much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The general idea is to use metrics in your dev workflow too.
d94a592
to
53bb3bd
Compare
## Description Extends several pieces: - introduces a "metric dumper" which is just a way of saying you can sample the internal metrics and write them to a CSV file (which should come in handy for 3 pieces that should come down the line; 1) CI using these to validate behavior, 2) debug dumps from 3rd parties, 3) local debugging) - along with the dumper the `doctor plot` has been extended to be able to read those dumps and also just generally improved some rough edges so it's less error prone. - node counts are here for relays. You now have a derived metric which simply counts unique daily node connections. Sample usage for the metrics dumper: `cargo run --bin iroh --all-features -- --metrics-dump-path test.metrics.csv start` Sample of the plotter: `cargo run --bin iroh --all-features -- doctor plot --timeframe 30 --interval 10 --file metrics.dump.csv magicsock_actor_tick_main_total,magicsock_actor_tick_msg_total,magicsock_actor_tick_endpoint_heartbeat_total,magicsock_actor_tick_endpoints_update_receiver_total,magicsock_actor_tick_re_stun_total` ## Breaking Changes <!-- Optional, if there are any breaking changes document them, including how to migrate older code. --> ## Notes & open questions This is merging into #2464 as part of the larger metrics refactor. ## Change checklist - [ ] Self-review. - [ ] Documentation updates following the [style guide](https://rust-lang.github.io/rfcs/1574-more-api-documentation-conventions.html#appendix-a-full-conventions-text), if relevant. - [ ] Tests if relevant. - [ ] All breaking changes documented.
## Description Extends several pieces: - introduces a "metric dumper" which is just a way of saying you can sample the internal metrics and write them to a CSV file (which should come in handy for 3 pieces that should come down the line; 1) CI using these to validate behavior, 2) debug dumps from 3rd parties, 3) local debugging) - along with the dumper the `doctor plot` has been extended to be able to read those dumps and also just generally improved some rough edges so it's less error prone. - node counts are here for relays. You now have a derived metric which simply counts unique daily node connections. Sample usage for the metrics dumper: `cargo run --bin iroh --all-features -- --metrics-dump-path test.metrics.csv start` Sample of the plotter: `cargo run --bin iroh --all-features -- doctor plot --timeframe 30 --interval 10 --file metrics.dump.csv magicsock_actor_tick_main_total,magicsock_actor_tick_msg_total,magicsock_actor_tick_endpoint_heartbeat_total,magicsock_actor_tick_endpoints_update_receiver_total,magicsock_actor_tick_re_stun_total` ## Breaking Changes <!-- Optional, if there are any breaking changes document them, including how to migrate older code. --> ## Notes & open questions This is merging into #2464 as part of the larger metrics refactor. ## Change checklist - [ ] Self-review. - [ ] Documentation updates following the [style guide](https://rust-lang.github.io/rfcs/1574-more-api-documentation-conventions.html#appendix-a-full-conventions-text), if relevant. - [ ] Tests if relevant. - [ ] All breaking changes documented.
a5db957
to
f4128fe
Compare
## Description - Cleans up stale counters - Introduces unique daily node counts on relays - Introduces a metrics dumper so we can create dumps of metrics and also view them in doctor plot - Introduces new trace level counters ## Breaking Changes - Introduces `metrics_dump_path` on the top level CLI parameters, which if set will make sure metrics are collected at regular intervals and written to the provided path in CSV format. ## Notes & open questions <!-- Any notes, remarks or open questions you have to make about the PR. --> ## Change checklist - [ ] Self-review. - [ ] Documentation updates following the [style guide](https://rust-lang.github.io/rfcs/1574-more-api-documentation-conventions.html#appendix-a-full-conventions-text), if relevant. - [ ] Tests if relevant. - [ ] All breaking changes documented.
Description
Breaking Changes
metrics_dump_path
on the top level CLI parameters, which if set will make sure metrics are collected at regular intervals and written to the provided path in CSV format.Notes & open questions
Change checklist